
Table 5.1: Analyst Label Function Address
j Polarity Coverage Overlaps Conflig Acc
lf_address_extract_model 0 [0, 1, 2, 3, 4] 0,3357 0,3185 0,0172 0,98
lf_in_dict 1 [0, 1, 2, 3, 4] 0,2435 0,138 0,1055 0,76
lf_pre_key 2 [0, 1, 2, 3, 4] 0,098 0,075 0,023 0,88
lf_pre_thanh_pho_is_tinh 3 [3] 0,0012 0,0008 0,0004 0,36
lf_pre_thanh_pho_is_quan 4 [2] 0,0054 0,0032 0,0022 0,14
lf_pre_is_quan 5 [3] 0,0025 0,00017 0,00233 0,47
5.3.5 Feature Extraction
After assigning label, we get LabelModel for each label, can use this La-
belModel directly for final model in One Stage method, but will always have
to use label functions, which causes some problems subject, especially when
the label functions use a third-party service. In this problem, we use the Two
Stage method, that is, train a new model based on the results of the La-
belModel and its features. This makes the end model no longer dependent on
the label function. First, we divide the existing dataset into train, test and
dev. See Appendix A11.
To extract the features for the Candidates we use the support from Fon-
duer. Frame work Fontuer allows to extract features from data in its Data
Model form including display features, structure features and text features.
Users only need to configure the specific types to use in the file “fonduer-
config.yaml”. See Appendix A12. In our problem, we only use structural fea-
tures and text features because the display feature requires calculating the
distance and size of the elements, which is not possible without rendering the
page on the browser. On the source code we just need to use the Featurizer
to create a feature matrix for the candidates. See Appendix A13.
5.3.6 Train Final Model
At the training stage, since the problem is a typical classification problem,
we use the LogisticRegression model supported in Fonduer, with the textual
and structural features described above. We then compare the results with
GoldLabel to evaluate the model.
40